.updatemenu-dropdown-button-group{ max-height: 50px; overflow-x: hidden; overflow-y: auto; }
1 Introduction
1.1 About
This project results from a collaboration between the groups of E. Levy (Weizmmann, IL) and J. Schacherer (CNRS/Univ. Strasbourg).
1.2 Abstract
Cells and their proteomes constantly evolve and adapt to survive in a wide variety of ecological niches. While phenotypic diversity arises in populations on relatively short timescales (1 to 100s thousand years), longer timescales (1 to 100s million years) are involved in divergence between species. The signatures of both evolutionary processes are carved in each and every single protein sequence through accumulation of mutations, thereby reshaping cell machineries, including protein complexes, signaling pathways, and metabolic pathways for example.
Interestingly, over long timescales, different proteins accumulate mutations at markedly different rates. For example, orthologous proteins in S. cerevisiae and S. pombe share 42% (+/- 14%) sequence identity on average, but some orthologs exhibit higher conservation. For example, actins share 89.7% sequence identity, illustrating that some proteins tolerate mutations more than others. One biophysical property correlates with sequence divergence more than any other and that is protein abundance. The more abundant a protein, the more conserved its sequence. While the abundance-conservation correlation is well established, its mechanistic origin is not well understood. Toxicity associated with mutation-induced misfolding has been suggested as a possible origin, but recent works, including ours (Dubreuil et al. 2019) show it is unlikely the main driving mechanism.
To resolve mechanisms shaping protein evolution, we propose to integrate analyses of sequence evolution across timescales. On the one hand, evolution across distant species shows wide dynamic range of sequence conservation, and reflects how entire proteomes diverge with time. On the other hand, it also presents two important drawbacks. First, that functional information of one species (e.g., protein abundance) is implicitly extrapolated to other species, and such an assumption is frequently incorrect. Second, the order in which mutations occur can hardly be traced. For example, considering a gene duplication event that occurred several million years ago, it is impossible to distinguish mutations that occurred right after the duplication event from those that arose later. Remarkably, both of these limitations are resolved when measuring evolution across a population. Indeed, the short evolutionary distances seen across strains mean that functional information can be extrapolated with high confidence, and mutations can be situated in a narrow time-window. Thus, by resolving these limitations, the integration of evolutionary data across species and strains (Figure @ref(fig:project-goal)) will provide new insights into constraints that biophysics (in particular abundance) versus function place on protein evolution.
(#fig:project-goal)Integrating timescales in protein evolution from distant species to populations
2 Transcriptomics
2.1 Summary
During evolution, mutations arise in genomes, potentially affecting every gene and protein at every site over time. Degrees of sequence divergence within populations and between species reflect the interplay between several interconnected constraints (e.g., genomic, transcriptional, functional, biophysical, structural, environmental…) and underlie phenotypic diversity and organism fitness. Protein expression is the main determinant of a protein’s sequence conservation (e.g., the more a gene is expressed, the more its sequence is conserved). Although this expression-conservation correlation is well established, its mechanistic origin is not well understood.
In this perspective, analyzing variation of gene expression across every layer could reveal the key processes driving evolution among several yeast (S. cerevisiae) isolates sharing an almost identical genetic background, except for newly acquired mutations, notably in protein-coding genes (e.g. Single Nucleotide Variations).
Gene expression is a multi-layered phenomenon in which variations occur differently at each step of the protein synthesis and life (transcription, translation and protein abundance). The exploration of each of these steps is therefore essential to define the constraints affecting the final protein abundance. Given this, we sought to understand how individuals differ in terms of translational regulation and if the translational variation is related to the transcriptional variation.
Therefore, we performed a joint exploration of RNA-seq, ribosome and proteomics profiling on 8 natural Saccharomyces cerevisiae yeast isolates with widely diverse genetic and ecologic origins. We found that the evolutionary constraints on gene expression showed different magnitudes depending on the expression layer. Specifically, expression variations tended to be 10% lower at the translational level, which was related to a phenomenon called post-transcriptional buffering. This phenomenon was more likely to affect some specific genes such as essential genes as well as genes related to protein complexes. Surprisingly, genes being preferentially affected by the post-transcriptional buffering tended to be less expressed than the genes not affected by this phenomenon. Together, these results highlight that variations in gene expression were shaped differently depending on the expression level and multiple factors, including certain functional and physical constraints. We have also been collecting proteomics data for these 8 isolates, which will allow us to assess with high confidence the balance between biophysics (in particular abundance) versus function on protein evolution.
2.2 Our datasets
RNA-seq & Ribo-seq of 8 natural isolates of Saccharomyces cerevisiae
We performed both ribosome profiling and RNA sequencing on eight S. cerevisiae natural isolates coming from very diverse ecological environments and being genetically strongly different (Peter et al.,2018). These isolates were cultivated in Synthetic Complete (SC) up to mid-log phase, harvested and flash-frozen. The RNA-seq and Ribo-seq experiment was performed in collaboration with the Riken institute in Japan.
## Our dataset
isolates = c('CPI','CMP','AMH','CQC','BPL','BTT','BED','BAN')
RNA_seq_data = readRDS(here("data",'RNA_seq.RDS'))
Ribo_seq_data = readRDS(here("data",'Ribo_seq.RDS'))
n = nrow(RNA_seq_data)
strains = fread(here("data",'strains.csv'),data.table = F)
strains = strains[strains$`Standardized name`%in%isolates,c(1,2,3,4,9,10,11,16)]
library(DT)
datatable(strains, rownames = FALSE, caption = NULL,
filter = "top", escape = FALSE, style = "default",
width = NULL, height = NULL)Our data encompassed 3755 genes. The data was normalized using TPM normalization where for each gene in each isolate, we divided the raw count by the ORF length and we applied a per million factor (total read count / 1,000,000)
p1= ggcorr(RNA_seq_data ,method = c("everything", "spearman"), label = T, label_round = 2, midpoint = 0.75, limits = c(0.6,1), size = 2)+
ggtitle('RNA-seq correlation matrix')
p2=ggcorr(Ribo_seq_data ,method = c("everything", "spearman"), label = T, label_round = 2, midpoint = 0.75, limits = c(0.6,1), size = 2)+
ggtitle('Ribo-seq correlation matrix')
grid.arrange(p1,p2, ncol=2)
We also calculated a Translation efficiency value which correspond, for each gene in each isolate, to the Ribo-seq TPM value divided by the RNA-seq TPM value. In brief, this represents how well a transcript will be used for translation.
2.3 Overlap with proteomic
proteomic_WIS2 = fread(here::here('output','median-normalized.txt'), data.table = F)
n = nrow(proteomic_WIS2)
a = fread(here::here('output','proteomics-normalized-log10_intensities.tsv'))
rownames(proteomic_WIS2)= a$UNIPROT
prot_gene_name = fread(here::here('output','test.csv'), fill = T, data.table = F) %>%
mutate(Chromosome = recode(Chromosome,'(3)'=V8) ) %>%
dplyr::select(-V8) %>%
dplyr::filter(!duplicated(prot_names))
# Calculate mean of normalized intensity per strain
int_strains = pivot_longer(proteomic_WIS2 %>% rownames_to_column('uniprot'),
-uniprot, values_to='int2use',
names_to = c('strain','bio','tech','day'),
names_pattern = "([^_]+)_([^_]+)_([^_]+)_([^_]+)") %>%
group_by(uniprot,strain=toupper(strain)) %>% summarize( mean_int = mean_(int2use)) %>%
pivot_wider(id_cols=uniprot, names_from = strain, names_glue = "{strain}", values_from = mean_int )
proteomic_WIS_mean_filtered = data.frame(int_strains) %>%
drop_na %>%
dplyr::filter(uniprot %in% prot_gene_name$prot_names) %>%
left_join(prot_gene_name[,c('prot_names','ID')], by=c('uniprot'='prot_names'))
all_exp = proteomic_WIS_mean_filtered %>%
inner_join(RNA_seq_data %>% rownames_to_column('ID'), by ='ID', suffix=c('_Prot','')) %>%
inner_join(Ribo_seq_data %>% rownames_to_column('ID'), by ='ID', suffix=c('_RNA','_Ribo'))
log2_all_exp= all_exp %>% column_to_rownames('ID') %>%
# transform to log2 to RNA and Ribo (already done for normalized proteomics)
mutate( across( ends_with(c('_RNA','_Ribo')), .fns = log2) ) %>%
select(where(is.numeric))
boxplot(log2_all_exp,las=2,ylab='Gene expression (log2)')
Our proteomic data encompassed 3429 genes. The overlap between the two datasets encompassed 2632 (after several filtration of some missing values). The data between the different data sets needed to be normalized so it can be studied together (see graph above). We therefore performed quantile normalization to obtain equally distributed data (see below)
df_rank <- apply(log2_all_exp,2,rank,ties.method="min")
df_sorted <- data.frame(apply(log2_all_exp, 2, sort))
df_mean <- apply(df_sorted, 1, mean)
index_to_mean <- function(my_index, my_mean){
return(my_mean[my_index])
}
log2_all_exp_norm <- apply(df_rank, 2, index_to_mean, my_mean=df_mean) %>% as_tibble
rownames(log2_all_exp_norm) = rownames(log2_all_exp)
boxplot(log2_all_exp_norm,las=2, ylab='Normalized gene expression (log2)', title='Quantile-normalized gene expression across RNA/Ribo/Proteomics')
3 Post transcriptional buffering
The post-transcriptional buffering (PTB) is a phenomenon where transcriptional variations tend to be buffered as the expression process progresses (add ref). It has been observed in different situations… To be completed
It is possible to detect this phenomenon in different ways:
3.1 Correlation between isolate
Using the quantile normalized data, we performed Spearman correlation between all the isolate profile in each expression level separately (using the median value of the replicate for the proteomic data)
Comparing the correlation values, it reveals that the profiles tended to be more and more similar as the expression process progresses (increasing of the correlation coefficient values), bringing a first proof of the presence of the phenomenon across our 8 isolates.
library(corrr)
cor_prot= select(log2_all_exp_norm, ends_with('_Prot')) %>% correlate(method = 'sp') %>% column_to_rownames('term')
#>
#> Correlation method: 'sp'
#> Missing treated using: 'pairwise.complete.obs'
cor_rna= select(log2_all_exp_norm, ends_with('_RNA')) %>% correlate(method = 'sp') %>% column_to_rownames('term')
#>
#> Correlation method: 'sp'
#> Missing treated using: 'pairwise.complete.obs'
cor_ribo= select(log2_all_exp_norm, ends_with('_Ribo')) %>% correlate(method = 'sp') %>% column_to_rownames('term')
#>
#> Correlation method: 'sp'
#> Missing treated using: 'pairwise.complete.obs'
cor_exp= tibble(RNA = cor_rna[lower.tri(cor_rna)],
Ribo = cor_ribo[lower.tri(cor_ribo)],
Prot = cor_prot[lower.tri(cor_prot)])
df_cor_exp = pivot_longer(cor_exp, everything()) %>%
arrange(name) %>% mutate(experiment = factor(name))
pair_exp = unique(df_cor_exp$experiment) %>% as.character() %>% combn(m = 2) %>% as_tibble()
ggplot(df_cor_exp,aes(x=experiment, y=value,fill=experiment))+
geom_boxplot()+
theme_classic()+
xlab('')+
ylab('Rho')+
stat_compare_means(comparisons =pair_exp)
3.2 Variation quantification
It is also possible to explore PTB by quantifying the variation in each pairwise comparison. Basically, for each gene in each isolate pairwise comparison, we use the absolute value of the log2 transformed fold to represent the intensity of the expression variation between the 2 isolates. The more this value increases, the more a gene displays variable regulation between isolate.
We calculated this value in each pairwise comparison, for each gene and at each expression level. We found that the intensities of the variations were decreasing as long as the expression process progresses, supporting once again the presence of the PTB phenomenon in our dataset.
#remotes::install_github("TimTeaFan/dplyover")
library(dplyover)
# calculate fold change pairwise for each experiment
fc_prot = select(log2_all_exp_norm,ends_with('_Prot')) %>%
rename_with(.fn=str_remove_all,pattern = "_Prot") %>%
transmute(across2x(everything(), everything(),.fns = ~foldchange(.x,.y),.comb = 'minimal'),
experiment='Prot')%>%
pivot_longer(-experiment,names_to='pair',values_to = 'fc')
fc_rna = select(log2_all_exp_norm,ends_with('_RNA')) %>%
rename_with(.fn=str_remove_all,pattern = "_RNA") %>%
transmute(across2x(everything(), everything(),.fns = ~foldchange(.x,.y),.comb = 'minimal'),
experiment='RNA')%>%
pivot_longer(-experiment,names_to='pair',values_to = 'fc')
fc_ribo = select(log2_all_exp_norm,ends_with('_Ribo')) %>%
rename_with(.fn=str_remove_all,pattern = "_Ribo") %>%
transmute(across2x(everything(), everything(),.fns = ~foldchange(.x,.y),.comb = 'minimal'),
experiment='Ribo') %>%
pivot_longer(-experiment,names_to='pair',values_to = 'fc')
# combine all pairwise foldchange
fc_exp = bind_rows(fc_prot,fc_rna,fc_ribo) %>% arrange(experiment,pair)
pair_exp=unique(fc_exp$experiment) %>% as.character() %>% combn(m = 2) %>% as_tibble()
ggplot(fc_exp,aes(experiment, log2(fc), fill=experiment))+
geom_boxplot()+
scale_y_log10()+
stat_compare_means(comparisons = pair_exp)+
theme_classic()+
xlab('')+
ylab('|log2(FC)|')
3.3 Euclidean distances between the profiles
Euclidean distances can also give information on how variable datasets can be. We once again used the quantile normalized data to calculate Euclidean distances between the profile in each expression level.
Consistently to the previous results, we found that the distances were lower and lower as the expression process progresses, suggesting once again that the expression variations were decreased. This also supported the presence of the PTB phenomenon
# dist_exp = log2_all_exp_norm %>% add_column(id=all_exp$ID) %>%
# pivot_longer(cols = -id,
# names_to = c('strain','experiment'),
# names_pattern = '(.+)_(.+)',
# values_to = 'exp') %>%
# pivot_wider(id_cols = c('id','experiment'),names_from='strain', values_from = 'exp') %>%
# group_by(experiment) %>%
# mutate( across2x(AMH:CQC, AMH:CQC, .fns =dist, method= 'euclidean', .comb = 'minimal'))
#
b=lapply(c('RNA','Ribo','Prot'), function(i){
i<<-i
a = log2_all_exp_norm[,grep(i, colnames(log2_all_exp_norm))]
a = dist(t(a))
a = as.matrix(a)
a= a[lower.tri(a)]
return(a)
})
names(b)= c('RNA','Ribo','Prot')
b=as.data.frame(b)
b = melt(b)
#> No id variables; using all as measure variables
pair_exp=unique(b$variable) %>% as.character() %>% combn(m = 2) %>% as_tibble()
ggplot(b,aes(x=variable,y=value, fill=variable))+
geom_boxplot()+
theme_classic()+
ylab('Euclidean distance')+
xlab('')+
stat_compare_means(comparisons = pair_exp)
3.4 Variance comparison
A simple variance calculation can also give a clue on the variation intensity of gene expression. We calculated the variance in all dataset for each gene and found that it tended to be higher at the transcription level, lower at the protein abundance level (with the translation level being in the middle), which was in accordance with the previous exploration
a=lapply(c('RNA','Ribo','Prot'), function(i){
i<<-i
a = log2_all_exp_norm[,grep(i, colnames(log2_all_exp_norm))]
a=apply(a,1 ,var)
return(a)
})
names(a)= c('RNA','Ribo','Prot')
a =as.data.frame(a)
b=a
b = melt(b)
#> No id variables; using all as measure variables
pair_exp=unique(b$variable) %>% as.character() %>% combn(m = 2) %>% as_tibble()
ggplot(b,aes(x=variable,y=value, fill=variable))+
geom_boxplot()+
theme_classic()+
ylab('Variance')+
xlab('')+
stat_compare_means(comparisons = pair_exp)+
scale_y_log10()
4 Proteomics experiment
4.1 Experiment design
For this experiment, we selected eight strains from a population of 1,011 S. cerevisiae isolates, representing the overall ecological, geographical and genetic diversity (Figure @ref(fig:phylo-8-strain).
The eight strains were grown on synthetic defined media (SD). We carefully monitored the strains growth to harvest cells close to the mid-log phase (OD ~ 0.5). Then, we proceed to wash the samples in PBS and flash-freeze cell pellets for a whole-lysate proteomics profiling.
For each strains, we prepared biological replicates that originated from two distinct colonies which and two technical replicates.
(#fig:phylo-8-strain)Phylogenetic tree of 1011 cerevisiae isolates highlighting the 8 strains used for RNASeq/RiboSeq/proteomics exepriment
Altogether there were 32 samples of yeast cell pellets including 4 replicates (two biological and two technical) that we sent to the Weizmann proteomics unit. (INPCM).
The samples submitted correspond to 4mL cultures with OD ranging from 0.4-0.8.
OD of samples (table)
4.2 Sample preparation
The cell pellets were subjected to lysis and in solution tryptic digestion using the S-Trap method (by Protifi) followed by a solid phase extraction cleaning step (Oasis HLB).
4.3 Liquid chromatography mass spectrometry
The resulting peptides were analyzed using nanoflow liquid chromatography (nanoAcquity) coupled to high resolution, high mass accuracy mass spectrometry (Thermo Exploris 480). Each sample was analyzed on the instrument separately in a random order in discovery mode.
4.4 Peptide identification and quantification
Raw data was processed with MaxQuant v1.6.6.0. The data were searched with the Andromeda search engine against a database containing protein sequences of Saccharomyces cerevisiae as downloaded from Uniprot.org, and appended with common lab protein contaminants.
The following modifications were defined for the search: Fixed modification- cysteine carbamidomethylation. Variable modifications- methionine oxidation and protein N-terminal acetylation.
The quantitative comparisons were calculated using Perseus v1.6.0.7. Decoy hits were filtered out and only proteins that were detected in at least two replicates of at least one experimental group were kept.
4.5 Loading Mass Spectrometry Results
First, we read the MaxQuant output file (CSV/XLSX format) containing all the hits. A hit corresponds to the set of peptides that were matched to a single protein group. Note: protein groups may contain one or multiple identified proteins (including contaminants)
#> Total number of proteins hits: 3635
| Sample Names: | |
|---|---|
| 1 | amh_1_1_12 |
| 2 | amh_1_2_10 |
| 3 | amh_2_1_12 |
| 4 | amh_2_2_12 |
| 5 | ban_1_1_12 |
| 6 | ban_1_2_10 |
| 7 | ban_2_1_10 |
| 8 | ban_2_2_10 |
| 9 | bed_1_1_12 |
| 10 | bed_1_2_10 |
| 11 | bed_2_1_12 |
| 12 | bed_2_2_10 |
| 13 | bpl_1_1_10 |
| 14 | bpl_1_2_12 |
| 15 | bpl_2_1_10 |
| 16 | bpl_2_2_12 |
| 17 | btt_1_1_12 |
| 18 | btt_1_2_10 |
| 19 | btt_2_1_12 |
| 20 | btt_2_2_12 |
| 21 | cmp_1_1_10 |
| 22 | cmp_1_2_12 |
| 23 | cmp_2_1_10 |
| 24 | cmp_2_2_10 |
| 25 | cpi_1_1_12 |
| 26 | cpi_1_2_10 |
| 27 | cpi_2_1_12 |
| 28 | cpi_2_2_12 |
| 29 | cqc_1_1_10 |
| 30 | cqc_1_2_12 |
| 31 | cqc_2_1_10 |
| 32 | cqc_2_2_10 |
Statistics of proteomics hits:
3635 total hits
- 20 contaminants
- 53 reversed sequences
- 3562 proteins identified:
- 3485 unique proteins
- 77 multiple proteins (with 77 duplicated pairs)
In this proteomics experiment, a large number of proteins were identified and quantified.
The quantified cellular proteome roughly amounts to half of the reference proteome sequences of the yeast S. cerevisiae S288C and close 70% of the cytoplasmic proteins detectable through MS-profiling of a whole-cell lysate.
5 Data processing
Before normalizing intensities, we first need to discard some hits because they are not suitable for analysis:
- contaminants (identifier starts with CON)
- reversed sequences (identifier starts with REV)
- multiple hits (peptides matching several proteins)
- not enough unique peptides (low confidence identification (< 2 peptides))
# Filter hits
ms1=filter_hits(ms0)
#> Discarding problematic hits...
#> * 104 = less than 2 unique peptides
#> * 20 = contaminated hits
#> * 52 = reversed sequences
#> * 57 = multi-protein hits
#> -----------------------------------------
#> -> 206 hits eliminated
#> => 3429 remaining hits for analysis
int_raw=read_maxquant(MAXQUANT.files$proteinGroups, zero.to.na = T, int_type = 'Intensity', pep_type = 'Unique peptides') %>%
filter_hits(verbose = F) %>%
dplyr::select(uniprot=majority_protein_i_ds,starts_with("intensity_")) %>%
#mutate(across(where(is.integer64), ~as.integer(.))) %>% # make sure to use numeric and not integer
as.data.frame() %>% column_to_rownames('uniprot') %>%
rename_with(everything(),.fn=gsub, pattern='intensity_', replacement='') %>%
rename_with(.fn = str_to_upper)
#> Total number of proteins hits: 36355.1 Processing quantified intensities
Following the aim of this experiment, we wish to compare the variation of protein expression between strains.
First, we compute the average intensities over the replicates.
Then, we discard proteins hits with missing values for XX strains or for YY replicates types of media.
Finally, we normalize the intensities by transforming to log2 and subtracting each sample’s median. (cf formula below)
Normalization method = Equalizing medians
\[ norm.int_{sample} = log2(raw.int_{sample}) - log2(median_{sample}) \] We filter out hits that contains missing values for either strains or replicates
# Process intensities
is.integer64 <- function(x) inherits(x, "integer64")
int_lfq = ms1 %>% dplyr::select(uniprot=majority_protein_i_ds,starts_with("lfq")) %>%
#mutate(across(where(is.integer64), ~as.integer(.))) %>% # make sure to use numeric and not integer
as.data.frame() %>% column_to_rownames('uniprot') %>%
rename_with(everything(),.fn=gsub, pattern='lfq_intensity_', replacement='') %>%
rename_with(.fn = str_to_upper)
int_md_norm = center_intensities(int_lfq, center='median', tolog2=T) %>%
as.data.frame() %>%
rename_with(everything(),.fn =gsub, pattern='lfq_intensity_', replacement='') %>%
rename_with(.fn = str_to_upper)
#> Normalize log2-transformed intensities by the samples median ...
# TRYING DIFFERENT NORMALIZATIONS SCHEME (NOT READY)
INT_NORM = normalize_intensities(int = int_raw, design = df.group)
#> Input data checked. All fields are valid.
#> Sample check: More than one sample group found
#> Sample replication check: All samples have replicates
#> No RT column found, skipping RT processing
#> No RT column specified (column named 'RT') or option not specified Skipping RT normalization.
TAB_NORM = kbl(names(INT_NORM@normalizations),row.names = T,col.names = 'Normalizations:',position = 'left') %>%
kable_paper("striped", full_width = F) %>%
kable_styling(position='left')
int_norm = int_md_norm %>% rownames_to_column('uniprot')
int_norm_ids = int_norm %>% dplyr::left_join(sc_identifiers,by=c('uniprot'='UNIPROT')) %>% filter(!duplicated(uniprot))
#compare_exp(log2(int_raw),int_loess_norm, all=T)
long_int_norm = pivot_longer(int_norm , cols=-uniprot, values_to='int2use',
names_to = c('strain','bio','tech','day'),
names_pattern = "([^_]+)_([^_]+)_([^_]+)_([^_]+)") %>%
group_by(strain,uniprot) %>% mutate(na_rep = sum.na(int2use))
# Intensities across strains (default is average)
int_by_strain = pivot_wider(long_int_norm ,
id_cols=uniprot,
names_from = 'strain',
names_glue = "{strain}",
values_from = "int2use",
values_fn=mean_
)
na_by_strain = pivot_wider(long_int_norm,
id_cols=uniprot,
names_from = c('strain'),
names_glue = "na_rep_{strain}",
values_from = "int2use",
values_fn=sum.na
)
df_strains= left_join(int_by_strain,na_by_strain) %>%
rowwise %>%
mutate( na_strains = sum.na(c_across(cols = starts_with('lfq_int'))) )
#> Joining, by = "uniprot"
int_all = int_norm %>% column_to_rownames('uniprot') %>% as.data.frame()
# Removing hits with missing values for more than one strain (using average intensities over replicates)
ms2= df_strains %>% filter(na_strains < 1)
int_filt_strains = ms2 %>% dplyr::select(-starts_with('na')) %>% column_to_rownames('uniprot') %>% as.data.frame()
# Save processed & normalized intensities data
#write_rds(df_strains,here::here('output',sprintf("%02d-normalized-intensities.rds",chap_cur)))library(MsCoreUtils)
#>
#> Attaching package: 'MsCoreUtils'
#> The following objects are masked from 'package:data.table':
#>
#> %between%, between
#> The following object is masked from 'package:dplyr':
#>
#> between
#> The following object is masked from 'package:stats':
#>
#> smooth
int_norm_nona = int_norm %>% drop_na()
int_norm_bpca = impute_bpca(int_norm) %>% as_tibble() %>% add_column(uniprot = int_norm_ids$uniprot)
#> Loading required namespace: pcaMethods
#colSums(is.na(int_norm_nona))
#draw_scatterplots(int_norm_nona)6 Quality control
First, we retrieve the LFQ-intensities of each protein hit (label-free quantitation).
# get lfq-peptide intensities (lfq=label-free quantitation)
intensities = ms1 %>% dplyr::select(uniprot=majority_protein_i_ds,starts_with("lfq"))
# Convert intensities to long format
long_int_all = get_long_intensities(intensities) %>% mutate(int2use = log10_int)6.1 Total sample intensities
#> Joining, by = "sample"
(#fig:tot_int)Total intensities per sample
6.2 Boxplot of sample raw intensities
We want to inspect the distribution of peptide intensities between strains. In addition, we will also observe in how many replicates each hit was quantified.
The distribution of peptide intensities between all strains does not show strong differences of expression:
(#fig:boxplot-int_ub)Distribution of expression for ubiquitous hits (i.e. detected in all strains)
| strain name | AMH | BAN | BED | BPL | BTT | CMP | CPI | CQC |
|---|---|---|---|---|---|---|---|---|
| median_exp | 7.9 | 7.92 | 7.92 | 7.9 | 7.92 | 7.95 | 7.91 | 7.97 |
| # replicates | 1 | 2 | 3 | 4 |
|---|---|---|---|---|
| median_exp | 6.82 | 6.92 | 7.04 | 8 |
On average, peptide intensities are higher when a hit is found in more than one replicate
6.3 Count of missing hits per sample
The following barplot shows the range of missing hits per samples.
#> Joining, by = "sample"
(#fig:missing-hit-samples)Count of missing hits per samples
CQC and CMP are the two strains containing the most missing values.
Particularly, the first biological replicate of those strains have twice more missing protein intensities (509 hits so 10-16% NAs) while the rest of the samples have on average about only 231 hits missing (4% to 9% NAs).
We then look at hits quantified as a function of the number of strains in which at least one peptide was quantified. (non-ubiquitous hits)
(#fig:missing-hit)Distribution of expression for hits not detected in all strains
| # proteins | # expressing strains | median exp. (log10) | % quantified | cum. % | cum. % (decreasing) |
|---|---|---|---|---|---|
| 2280 | 8 | 8.20 | 66.5% | 100.0% | 66.5% |
| 305 | 7 | 7.45 | 8.9% | 33.5% | 75.4% |
| 209 | 6 | 7.24 | 6.1% | 24.6% | 81.5% |
| 144 | 5 | 7.12 | 4.2% | 18.5% | 85.7% |
| 111 | 4 | 7.02 | 3.2% | 14.3% | 88.9% |
| 96 | 3 | 6.99 | 2.8% | 11.1% | 91.7% |
| 90 | 2 | 6.96 | 2.6% | 8.3% | 94.3% |
| 84 | 1 | 6.89 | 2.4% | 5.7% | 96.8% |
| 110 | 0 | 6.78 | 3.2% | 3.2% | 100.0% |
Proteins expressed across less strains have lower median peptide intensity. Nevertheless, almost two thirds (66.5%) of the protein hits were detected in all strains. About 86% protein hits were expressed in the majority of the strains (at least 5 out of 8 strains).
6.4 Missing hits across samples
Furthermore, we can check whether the proportion of missing values is equally distributed among all samples.
(#fig:na-samples)Number of quantified intensities across samples
6.5 Coefficient of variations
Coefficient of variations correspond to the percent of variance relative to the mean. The following boxplots show how variable protein expression is, across strains or across biological replicates, before and after normalization.
(#fig:boxplot-cv)Coefficient of variations (%) in each sample and per strains before and after processing data (filtering and normalization)
| strain | min | q25 | md | q75 | max |
|---|---|---|---|---|---|
| amh | 0.00 | 0.46 | 0.74 | 1.20 | 11.96 |
| ban | 0.01 | 0.52 | 0.81 | 1.29 | 15.35 |
| bed | 0.01 | 0.44 | 0.72 | 1.18 | 13.78 |
| bpl | 0.01 | 0.47 | 0.79 | 1.33 | 17.42 |
| btt | 0.01 | 0.52 | 0.81 | 1.33 | 14.15 |
| cmp | 0.01 | 0.77 | 1.24 | 1.94 | 14.66 |
| cpi | 0.01 | 0.42 | 0.67 | 1.09 | 13.47 |
| cqc | 0.01 | 1.24 | 1.91 | 2.81 | 21.60 |
| biological_rep | min | q25 | md | q75 | max |
|---|---|---|---|---|---|
| amh-1 | 0 | 0.18 | 0.39 | 0.79 | 20.06 |
| amh-2 | 0 | 0.19 | 0.39 | 0.80 | 8.71 |
| ban-1 | 0 | 0.29 | 0.60 | 1.12 | 14.13 |
| ban-2 | 0 | 0.23 | 0.49 | 0.90 | 16.44 |
| bed-1 | 0 | 0.15 | 0.36 | 0.71 | 10.64 |
| bed-2 | 0 | 0.30 | 0.68 | 1.25 | 16.97 |
| bpl-1 | 0 | 0.19 | 0.44 | 0.89 | 16.35 |
| bpl-2 | 0 | 0.20 | 0.48 | 0.96 | 19.18 |
| btt-1 | 0 | 0.29 | 0.59 | 1.10 | 15.01 |
| btt-2 | 0 | 0.17 | 0.39 | 0.80 | 15.62 |
| cmp-1 | 0 | 0.33 | 0.70 | 1.27 | 21.19 |
| cmp-2 | 0 | 0.16 | 0.36 | 0.77 | 10.67 |
| cpi-1 | 0 | 0.18 | 0.42 | 0.78 | 16.48 |
| cpi-2 | 0 | 0.18 | 0.39 | 0.78 | 21.08 |
| cqc-1 | 0 | 0.54 | 1.24 | 2.38 | 18.59 |
| cqc-2 | 0 | 0.38 | 0.86 | 1.52 | 14.99 |
| strain | min | q25 | md | q75 | max |
|---|---|---|---|---|---|
| AMH | 0.04 | 1.26 | 2.17 | 3.94 | 49.23 |
| BAN | 0.08 | 1.34 | 2.30 | 4.30 | 60.07 |
| BED | 0.11 | 1.22 | 2.17 | 3.96 | 50.95 |
| BPL | 0.03 | 1.28 | 2.30 | 4.45 | 75.28 |
| BTT | 0.02 | 1.36 | 2.39 | 4.47 | 58.83 |
| CMP | 0.06 | 2.22 | 3.74 | 6.40 | 141.42 |
| CPI | 0.05 | 1.13 | 2.02 | 3.60 | 50.19 |
| CQC | 0.18 | 3.31 | 5.60 | 9.20 | 90.23 |
| biological_rep | min | q25 | md | q75 | max |
|---|---|---|---|---|---|
| AMH-1 | 0 | 0.47 | 1.11 | 2.53 | 63.80 |
| AMH-2 | 0 | 0.51 | 1.12 | 2.49 | 39.87 |
| BAN-1 | 0 | 0.64 | 1.55 | 3.45 | 58.00 |
| BAN-2 | 0 | 0.63 | 1.41 | 2.78 | 63.59 |
| BED-1 | 0 | 0.43 | 1.03 | 2.27 | 44.53 |
| BED-2 | 0 | 0.88 | 1.99 | 3.91 | 71.41 |
| BPL-1 | 0 | 0.49 | 1.20 | 2.73 | 51.93 |
| BPL-2 | 0 | 0.52 | 1.34 | 3.03 | 91.56 |
| BTT-1 | 0 | 0.57 | 1.42 | 3.35 | 58.80 |
| BTT-2 | 0 | 0.48 | 1.13 | 2.54 | 45.46 |
| CMP-1 | 0 | 0.91 | 2.04 | 4.00 | 71.14 |
| CMP-2 | 0 | 0.44 | 1.03 | 2.35 | 38.36 |
| CPI-1 | 0 | 0.52 | 1.18 | 2.48 | 52.70 |
| CPI-2 | 0 | 0.49 | 1.10 | 2.49 | 95.71 |
| CQC-1 | 0 | 1.64 | 3.54 | 7.43 | 63.55 |
| CQC-2 | 0 | 1.08 | 2.48 | 4.55 | 63.29 |
6.6 Expression distributions
To highlight the strength of normalization, we also show the density distribution of expression before and after normalization using each of the following normalization methods:
| Normalizations: | |
|---|---|
| 1 | log2 |
| 2 | VSN |
| 3 | GI |
| 4 | median |
| 5 | mean |
| 6 | Quantile |
| 7 | CycLoess |
| 8 | RLR |
draw_normalization_density(int_raw,int_lfq,df.group)
#> Input data checked. All fields are valid.
#> Sample check: More than one sample group found
#> Sample replication check: All samples have replicates
#> No RT column found, skipping RT processing
#> No RT column specified (column named 'RT') or option not specified Skipping RT normalization.
#> Joining, by = "sample"
#> Joining, by = "sample"
6.7 Compare all-vs-all expression
6.7.1 Scatterplots
# scatterplots between all samples
scmat_all=draw_scatterplots(datain=int_all)
print(scmat_all)
6.7.2 Heatmap correlation
# heatmap correlation
cs_all=compute_samples_correlation(int_all)
#> Compute pairwise samples correlation (Spearman)...
#>
#> Correlation method: 'spearman'
#> Missing treated using: 'pairwise.complete'
COR_RANGE = range( cs_all[row(cs_all) == (col(cs_all) - 1)] )
by_sample = df.group %>% column_to_rownames("sample")
hm_all=draw_heatmap_samples(mcor = cs_all,df.group=by_sample,col.group = col.group)
The heatmap correlations of all samples show the high correlation of expression between replicates and between most strains: [0.824 - 0.992]
The expression from the 1st biological replicates of strains CQC and CMP seem slighlty less correlated to the other samples.
6.7.3 Principle Component Analysis
Finally, the PCA reveals the distance between each sample.
int_all_scaled = scale(int_all,center=T, scale = T)
make_pca(na.omit(int_all_scaled), with_labels=T,col_by_group=1:4)
7 Strains expression comparison
7.1 Scatterplots
Initially, we can look at the scatterplots of intensities all-versus-all samples.
# scatterplots between strains
scmat_strains=draw_scatterplots(datain=int_filt_strains)
print(scmat_strains)
7.2 Heatmap Correlation
We then compute spearman rank correlations of intensities between all samples.
# heatmap correlation
cs_strains=compute_samples_correlation(int_filt_strains)
#> Compute pairwise samples correlation (Spearman)...
#>
#> Correlation method: 'spearman'
#> Missing treated using: 'pairwise.complete'
COR_RANGE = range( cs_strains[row(cs_strains) == (col(cs_strains) - 1)] )
hm_strains=draw_heatmap_samples(mcor = cs_strains,df.group=c(),col.group = col.group)
The heatmap correlations show the high correlation of expression between strains: [0.942 - 0.977]
7.3 Principle Component Analysis
Finally, the PCA reveals the distance between each sample.
int_scaled_strains = scale(int_filt_strains,center=T, scale = T)
make_pca(na.omit(int_scaled_strains), with_labels=F,col_by_group=1:2)
8 Differential Expression
We then compare the protein expression between strains for each pairwise comparison (28 unique pairwise combinations)
To detect differentially expressed genes, we use the limma analysis on normalized protein expression.
8.1 Volcano plots
First, we can look at the volcano plots (log-foldchange vs qvalue) for each unique pairwise comparison.
#ind_na_rows = find_na_rows(int_norm,as.indices = T)
df_imputed = tibble( uniprot = rownames(int_norm), is_imputed = rowSums( is.na(int_norm))>0)
#int_norm_bpca$imputed = rowSums( is.na(int_norm))>0
volcPlot(INPUT=int_norm_bpca, IMPUTED=df_imputed, MIN_LFC=2, MIN_PVAL=0.01, WHICH='both', TOPN = 20)
#> Joining, by = c("ID", "pValue", "qValue", "EffectSize",
#> "comparison", "sig", "log10_qvalue", "SGD", "ORF",
#> "UNIPROT", "GENENAME", "is_imputed")